Evaluation of the stochastic morphosyntactic language model on a one million word hungarian dictation task
نویسندگان
چکیده
In this article we evaluate our stochastic morphosyntactic language model (SMLM) on a Hungarian newspaper dictation task that requires modeling over 1 million different word forms. The proposed method is based on the use of morphemes as the basic recognition units and the combination of a morpheme gram model and a morphosyntactic language model. The architecture of the recognition system is based on the weighted finite-state transducer (WFST) paradigm. Thanks to the flexible transducer-based architecture, the morphosyntactic component is integrated seamlessly with the basic modules with no need to modify the decoder itself. We compare the phoneme, morpheme, and word error-rates as well as the sizes of the recognition networks in two configurations. In one configuration we use only the -gram model while in the other we use the combined model. The proposed stochastic morphosyntactic language model decreases the morpheme error rate by between 1.7 and 7.2% relatively when compared to the baseline trigram system. The morpheme error-rate of the best configuration is 18% and the best word error-rate is 22.3%.
منابع مشابه
Finite-state transducer based modeling of morphosyntax with applications to Hungarian LVCSR
This article introduces a novel approach to model morphosyntax in morpheme unit based speech recognizers. The proposed method is evaluated in our recent Hungarian large vocabulary continuous speech recognition (LVCSR) system. The architecture of the recognition system is based on the weighted finite state transducer (WFST) paradigm. The task domain is the recognition of fluently read sentences ...
متن کاملFinite-state Transducer Based Phonology and Morphology Modeling with Applications to Hungarian Lvcsr
This article introduces a novel approach to model phonology and morphosyntax in morpheme unit based speech recognizers. The proposed method is evaluated in our recent Hungarian large vocabulary continuous speech recognition (LVCSR) system. The architecture of the recognition system is based on the weighted finite state transducer (WFST) paradigm. The task domain is the recognition of fluently r...
متن کاملModeling Morphosyntax with Finite-state Transducers and Its Application to Hungarian Lvcsr
Large vocabulary speech recognition systems for several languages have to use morphemes as the basic recognition units. Such systems are frequently suffering from the over-generation property of the smoothed N -gram language model. The source of the problem is that most of the function-morphemes are very short and their unigram likelihood is high. These morphemes are inserted frequently in the ...
متن کاملDevelopment of a Hungarian Medical Dictation System
This paper reviews the current state of a Hungarian project which seeks to create a speech recognition system for the dictation of thyroid gland medical reports. First, we present the MRBA speech corpus that was assembled to support the training of general-purpose Hungarian speech recognition systems. Then we describe the processing of medical reports that were collected to help the creation of...
متن کاملCombining Morphosyntactic Enriched Representation with n-best Reranking in Statistical Translation
The purpose of this work is to explore the integration of morphosyntactic information into the translation model itself, by enriching words with their morphosyntactic categories. We investigate word disambiguation using morphosyntactic categories, n-best hypotheses reranking, and the combination of both methods with word or morphosyntactic n-gram language model reranking. Experiments are carrie...
متن کامل